Goto

Collaborating Authors

 target program


HarnessLLM: Automatic Testing Harness Generation via Reinforcement Learning

Liu, Yujian, Ji, Jiabao, Zhang, Yang, Guo, Wenbo, Jaakkola, Tommi, Chang, Shiyu

arXiv.org Artificial Intelligence

Existing LLM-based automatic test generation methods mainly produce input and expected output pairs to categorize the intended behavior of correct programs. Although straightforward, these methods have limited diversity in generated tests and cannot provide enough debugging information. We propose HarnessLLM, a two-stage training pipeline that enables LLMs to write harness code for testing. Particularly, LLMs generate code that synthesizes inputs and validates the observed outputs, allowing complex test cases and flexible output validation such as invariant checking. To achieve this, we train LLMs with SFT followed by RLVR with a customized reward design. Experiments show that HarnessLLM outperforms input-output-based testing in bug finding and testing strategy diversity. HarnessLLM further benefits the code generation performance through test-time scaling with our generated test cases as inference-phase validation. Our code is available at https://github.com/UCSB-NLP-Chang/HarnessLLM.git.


Automated Snippet-Alignment Data Augmentation for Code Translation

Zhang, Zhiming, Zhu, Qingfu, Luo, Xianzhen, Wang, Yixuan, Li, Bohan, Che, Wanxiang

arXiv.org Artificial Intelligence

Code translation aims to translate the code from its source language to the target language and is used in various software development scenarios. Recent developments in Large Language Models (LLMs) have showcased their capabilities in code translation, and parallel corpora play a crucial role in training models for code translation. Parallel corpora can be categorized into program-alignment (PA) and snippet-alignment (SA) data. Although PA data has complete context and is suitable for semantic alignment learning, it may not provide adequate fine-grained training signals due to its extended length, while the brevity of SA data enables more fine-grained alignment learning. Due to limited parallel corpora, researchers explore several augmentation methods for code translation. Previous studies mainly focus on augmenting PA data. In this paper, we propose a data augmentation method that leverages LLMs to generate SA data automatically. To fully leverage both PA data and SA data, we explore a simple yet effective two-stage training strategy, which consistently enhances model performance compared to fine-tuning solely on PA data. Experiments on TransCoder-test demonstrate that our augmented SA data combined with the two-stage training approach yields consistent improvements over the baseline, achieving a maximum gain of 3.78% on pass@k.


Fuzzing: Randomness? Reasoning! Efficient Directed Fuzzing via Large Language Models

Feng, Xiaotao, Zhu, Xiaogang, Hu, Kun, Wang, Jincheng, Cao, Yingjie, Gong, Guang, Pan, Jianfeng

arXiv.org Artificial Intelligence

Fuzzing is highly effective in detecting bugs due to the key contribution of randomness. However, randomness significantly reduces the efficiency of fuzzing, causing it to cost days or weeks to expose bugs. Even though directed fuzzing reduces randomness by guiding fuzzing towards target buggy locations, the dilemma of randomness still challenges directed fuzzers. Two critical components, which are seeds and mutators, contain randomness and are closely tied to the conditions required for triggering bugs. Therefore, to address the challenge of randomness, we propose to use large language models (LLMs) to remove the randomness in seeds and reduce the randomness in mutators. With their strong reasoning and code generation capabilities, LLMs can be used to generate reachable seeds that target pre-determined locations and to construct bug-specific mutators tailored for specific bugs. We propose RandLuzz, which integrates LLMs and directed fuzzing, to improve the quality of seeds and mutators, resulting in efficient bug exposure. RandLuzz analyzes function call chain or functionality to guide LLMs in generating reachable seeds. To construct bug-specific mutators, RandLuzz uses LLMs to perform bug analysis, obtaining information such as bug causes and mutation suggestions, which further help generate code that performs bug-specific mutations. We evaluate RandLuzz by comparing it with four state-of-the-art directed fuzzers, AFLGo, Beacon, WindRanger, and SelectFuzz. With RandLuzz-generated seeds, the fuzzers achieve an average speedup ranging from 2.1$\times$ to 4.8$\times$ compared to using widely-used initial seeds. Additionally, when evaluated on individual bugs, RandLuzz achieves up to a 2.7$\times$ speedup compared to the second-fastest exposure. On 8 bugs, RandLuzz can even expose them within 60 seconds.


$T^3$: Multi-level Tree-based Automatic Program Repair with Large Language Models

Liu, Quanming, Bu, Xupeng, Yan, Zhichao, Li, Ru

arXiv.org Artificial Intelligence

Automatic Program Repair (APR) is a core technology in software development and maintenance, with aims to enable automated defect repair with minimal human intervention. In recent years, the substantial advancements in Large Language Models (LLMs) and the Chain-of-Thought (CoT) techniques have significantly enhanced the reasoning capabilities of these models. However, due to the complex logic and multi-step reasoning ability needed, the application of CoT techniques in the APR domain remains insufficient. This study systematically evaluates the performance of several common CoT techniques in APR tasks and proposes an innovative framework $T^3$, which integrates the powerful reasoning capabilities of LLMs with tree search, effectively improving the precision of generating candidate repair solutions. Furthermore, $T^3$ provides valuable guidance for optimizing sample selection and repair strategies in APR tasks, establishing a robust framework for achieving efficient automated debugging.


Integrating Large Language Models and Reinforcement Learning for Non-Linear Reasoning

Alon, Yoav, David, Cristina

arXiv.org Artificial Intelligence

Large Language Models (LLMs) were shown to struggle with long-term planning, which may be caused by the limited way in which they explore the space of possible solutions. We propose an architecture where a Reinforcement Learning (RL) Agent guides an LLM's space exploration: (1) the Agent has access to domain-specific information, and can therefore make decisions about the quality of candidate solutions based on specific and relevant metrics, which were not explicitly considered by the LLM's training objective; (2) the LLM can focus on generating immediate next steps, without the need for long-term planning. We allow non-linear reasoning by exploring alternative paths and backtracking. We evaluate this architecture on the program equivalence task, and compare it against Chain of Thought (CoT) and Tree of Thoughts (ToT). We assess both the downstream task, denoting the binary classification, and the intermediate reasoning steps. Our approach compares positively against CoT and ToT.


TRANSAGENT: An LLM-Based Multi-Agent System for Code Translation

Yuan, Zhiqiang, Chen, Weitong, Wang, Hanlin, Yu, Kai, Peng, Xin, Lou, Yiling

arXiv.org Artificial Intelligence

Code translation converts code from one programming language to another while maintaining its original functionality, which is crucial for software migration, system refactoring, and cross-platform development. Traditional rule-based methods rely on manually-written rules, which can be time-consuming and often result in less readable code. To overcome this, learning-based methods have been developed, leveraging parallel data to train models for automated code translation. More recently, the advance of Large Language Models (LLMs) further boosts learning-based code translation. Although promising, LLM-translated program still suffers from diverse quality issues (e.g., syntax errors and semantic errors). In particular, it can be challenging for LLMs to self-debug these errors when simply provided with the corresponding error messages. In this work, we propose a novel LLM-based multi-agent system TRANSAGENT, which enhances LLM-based code translation by fixing the syntax errors and semantic errors with the synergy between four LLM-based agents, including Initial Code Translator, Syntax Error Fixer, Code Aligner, and Semantic Error Fixer. The main insight of TRANSAGENT is to first localize the error code block in the target program based on the execution alignment between the target and source program, which can narrow down the fixing space and thus lower down the fixing difficulties. To evaluate TRANSAGENT, we first construct a new benchmark from recent programming tasks to mitigate the potential data leakage issue. On our benchmark, TRANSAGENT outperforms the latest LLM-based code translation technique UniTrans in both translation effectiveness and efficiency; additionally, our evaluation on different LLMs show the generalization of TRANSAGENT and our ablation study shows the contribution of each agent.


Validating LLM-Generated Programs with Metamorphic Prompt Testing

Wang, Xiaoyin, Zhu, Dakai

arXiv.org Artificial Intelligence

The latest paradigm shift in software development brings in the innovation and automation afforded by Large Language Models (LLMs), showcased by Generative Pre-trained Transformer (GPT), which has shown remarkable capacity to generate code autonomously, significantly reducing the manual effort required for various programming tasks. Although, the potential benefits of LLM-generated code are vast, most notably in efficiency and rapid prototyping, as LLMs become increasingly integrated into the software development lifecycle and hence the supply chain, complex and multifaceted challenges arise as the code generated from these language models carry profound questions on quality and correctness. Research is required to comprehensively explore these critical concerns surrounding LLM-generated code. In this paper, we propose a novel solution called metamorphic prompt testing to address these challenges. Our intuitive observation is that intrinsic consistency always exists among correct code pieces but may not exist among flawed code pieces, so we can detect flaws in the code by detecting inconsistencies. Therefore, we can vary a given prompt to multiple prompts with paraphrasing, and to ask the LLM to acquire multiple versions of generated code, so that we can validate whether the semantic relations still hold in the acquired code through cross-validation. Our evaluation on HumanEval shows that metamorphic prompt testing is able to detect 75 percent of the erroneous programs generated by GPT-4, with a false positive rate of 8.6 percent.


When Fuzzing Meets LLMs: Challenges and Opportunities

Jiang, Yu, Liang, Jie, Ma, Fuchen, Chen, Yuanliang, Zhou, Chijin, Shen, Yuheng, Wu, Zhiyong, Fu, Jingzhou, Wang, Mingzhe, Li, ShanShan, Zhang, Quan

arXiv.org Artificial Intelligence

Fuzzing, a widely-used technique for bug detection, has seen advancements through Large Language Models (LLMs). Despite their potential, LLMs face specific challenges in fuzzing. In this paper, we identified five major challenges of LLM-assisted fuzzing. To support our findings, we revisited the most recent papers from top-tier conferences, confirming that these challenges are widespread. As a remedy, we propose some actionable recommendations to help improve applying LLM in Fuzzing and conduct preliminary evaluations on DBMS fuzzing. The results demonstrate that our recommendations effectively address the identified challenges.


Augmenting Greybox Fuzzing with Generative AI

Hu, Jie, Zhang, Qian, Yin, Heng

arXiv.org Artificial Intelligence

In recent years, fuzz testing has emerged as an effective technique for testing software systems. For example, fuzz testing has been remarkably successful in uncovering critical security bugs in applications such as Chrome web-browser [1] and SQLLite database [11]. Generally, fuzz testing runs a program with seed inputs, mutates the previous inputs to improve a given guidance metric such as branch coverage, and repeats this cycle of input mutation and the target program execution. During the fuzzing process, we often execute the target program with generated large amount of test cases and monitor the runtime behavior to find vulnerabilities. For that, it is essential to generate test cases that effectively cover a wide range of execution paths and program behaviors. This comprehensive coverage enables thorough exploration of the program's functionality and helps uncover potential vulnerabilities or issues. The simplicity of fuzzing has made it a de-facto testing procedure for large-scale software systems; however, its effectiveness is based on an inherent yet oversighted assumption: a set of arbitrary input mutations is likely to yield meaningful inputs. In fact, our extensive experience suggests that this assumption often does not hold for most software systems that take highly structured data as inputs.


Autonomous Intelligent Software Development

Matties, Mark Alan

arXiv.org Artificial Intelligence

We present an overview of the design and first proof-of-concept implementation for AIDA, an autonomous intelligent developer agent that develops software from scratch. AIDA takes a software requirements specification and uses reasoning over a semantic knowledge graph to interpret the requirements, then designs and writes software to satisfy them. AIDA uses both declarative and procedural knowledge in the core domains of data, algorithms, and code, plus some general knowledge. The reasoning codebase uses this knowledge to identify needed components, then designs and builds the necessary information structures around them that become the software. These structures, the motivating requirements, and the resulting source code itself are all new knowledge that are added to the knowledge graph, becoming available for future reasoning. In this way, AIDA also learns as she writes code and becomes more efficient when writing subsequent code.